Goto

Collaborating Authors

 research data


Overcoming Representation Bias in Fairness-Aware data Repair using Optimal Transport

Langbridge, Abigail, Quinn, Anthony, Shorten, Robert

arXiv.org Artificial Intelligence

Optimal transport (OT) has an important role in transforming data distributions in a manner which engenders fairness. Typically, the OT operators are learnt from the unfair attribute-labelled data, and then used for their repair. Two significant limitations of this approach are as follows: (i) the OT operators for underrepresented subgroups are poorly learnt (i.e. they are susceptible to representation bias); and (ii) these OT repairs cannot be effected on identically distributed but out-of-sample (i.e.\ archival) data. In this paper, we address both of these problems by adopting a Bayesian nonparametric stopping rule for learning each attribute-labelled component of the data distribution. The induced OT-optimal quantization operators can then be used to repair the archival data. We formulate a novel definition of the fair distributional target, along with quantifiers that allow us to trade fairness against damage in the transformed data. These are used to reveal excellent performance of our representation-bias-tolerant scheme in simulated and benchmark data sets.


NFDIcore 2.0: A BFO-Compliant Ontology for Multi-Domain Research Infrastructures

Bruns, Oleksandra, Tietz, Tabea, Waitelonis, Joerg, Posthumus, Etienne, Sack, Harald

arXiv.org Artificial Intelligence

This paper presents NFDIcore 2.0, an ontology compliant with the Basic Formal Ontology (BFO) designed to represent the diverse research communities of the National Research Data Infrastructure (NFDI) in Germany. NFDIcore ensures the interoperability across various research disciplines, thereby facilitating cross-domain research. Each domain's individual requirements are addressed through specific ontology modules. This paper discusses lessons learned during the ontology development and mapping process, supported by practical validation through use cases in diverse research domains. The originality of NFDIcore lies in its adherence to BFO, the use of SWRL rules for efficient knowledge discovery, and its modular, extensible design tailored to meet the needs of heterogeneous research domains.


Towards a Knowledge Graph for Models and Algorithms in Applied Mathematics

Schembera, Björn, Wübbeling, Frank, Kleikamp, Hendrik, Schmidt, Burkhard, Shehu, Aurela, Reidelbach, Marco, Biedinger, Christine, Fiedler, Jochen, Koprucki, Thomas, Iglezakis, Dorothea, Göddeke, Dominik

arXiv.org Artificial Intelligence

Mathematical models and algorithms are an essential part of mathematical research data, as they are epistemically grounding numerical data. In order to represent models and algorithms as well as their relationship semantically to make this research data FAIR, two previously distinct ontologies were merged and extended, becoming a living knowledge graph. The link between the two ontologies is established by introducing computational tasks, as they occur in modeling, corresponding to algorithmic tasks. Moreover, controlled vocabularies are incorporated and a new class, distinguishing base quantities from specific use case quantities, was introduced. Also, both models and algorithms can now be enriched with metadata. Subject-specific metadata is particularly relevant here, such as the symmetry of a matrix or the linearity of a mathematical model. This is the only way to express specific workflows with concrete models and algorithms, as the feasible solution algorithm can only be determined if the mathematical properties of a model are known. We demonstrate this using two examples from different application areas of applied mathematics. In addition, we have already integrated over 250 research assets from applied mathematics into our knowledge graph.


Optimal Transport for Fairness: Archival Data Repair using Small Research Data Sets

Langbridge, Abigail, Quinn, Anthony, Shorten, Robert

arXiv.org Artificial Intelligence

With the advent of the AI Act and other regulations, there is now an urgent need for algorithms that repair unfairness in training data. In this paper, we define fairness in terms of conditional independence between protected attributes ($S$) and features ($X$), given unprotected attributes ($U$). We address the important setting in which torrents of archival data need to be repaired, using only a small proportion of these data, which are $S|U$-labelled (the research data). We use the latter to design optimal transport (OT)-based repair plans on interpolated supports. This allows {\em off-sample}, labelled, archival data to be repaired, subject to stationarity assumptions. It also significantly reduces the size of the supports of the OT plans, with correspondingly large savings in the cost of their design and of their {\em sequential\/} application to the off-sample data. We provide detailed experimental results with simulated and benchmark real data (the Adult data set). Our performance figures demonstrate effective repair -- in the sense of quenching conditional dependence -- of large quantities of off-sample, labelled (archival) data.


Grandma Karl is 27 years old -- research agenda for pseudonymization of research data

Volodina, Elena, Dobnik, Simon, Tiedemann, Therese Lindström, Vu, Xuan-Son

arXiv.org Artificial Intelligence

Accessibility of research data is critical for advances in many research fields, but textual data often cannot be shared due to the personal and sensitive information which it contains, e.g names or political opinions. General Data Protection Regulation (GDPR) suggests pseudonymization as a solution to secure open access to research data, but we need to learn more about pseudonymization as an approach before adopting it for manipulation of research data. This paper outlines a research agenda within pseudonymization, namely need of studies into the effects of pseudonymization on unstructured data in relation to e.g. readability and language assessment, as well as the effectiveness of pseudonymization as a way of protecting writer identity, while also exploring different ways of developing context-sensitive algorithms for detection, labelling and replacement of personal information in unstructured data. The recently granted project on pseudonymization Grandma Karl is 27 years old addresses exactly those challenges.


Senior Faculty

#artificialintelligence

The Center for Computational Life Sciences at Cleveland Clinic offers an opportunity for a visionary Senior Faculty to establish our biomedical research and healthcare system as a global leader in Artificial Intelligence Research (AI). This Senior Faculty position will shape and expand AI technologies and applications centered around biomedical science and healthcare, supported by strong institutional commitment to build their vision. The Center for Computational Life Sciences will serve as a multi-institutional hub for collaborative research in computational life sciences and biomedicine positioned at the intersection of modern biology, computer science, predictive modeling, simulation, and AI. Cleveland Clinic's electronic health record (EHR) is one of the largest in the world, containing data on over 7 million patient lives. Multi omics data go back well over 20 years, comprised of over billions of data points.


10 Interesting Facts on Open Science: Scientific Revolution.

#artificialintelligence

The development in the number and scale of universities throughout the world, as well as the expansion of their research endeavors as a method of enhancing their reputations and attracting both students and sponsors, is driving demand in this lucrative academic publishing sector. Because publishing metrics have become the key indicator of academic achievement and the primary motivator for career development, they have become the primary gauge of academic performance and the primary incentive for career progress. The concept "publish or perish" has become norm many fields. As a result, the rate of scientific publishing has increased exponentially in recent decades, with output rates approaching 2.5 million per year by 2017. The proliferation of so-called "predatory" journals, which provide speedy publishing without peer review or considerable editorial control, is another result of this increase in demand for publication channels.To counter the current science climate, Open Science has emerged.


AI with swarm intelligence: A novel technology for cooperative analysis of big data

#artificialintelligence

Science and medicine are becoming increasingly digital. Analyzing the resulting volumes of information -- known as "big data" -- is considered a key to better treatment options. "Medical research data are a treasure. They can play a decisive role in developing personalized therapies that are tailored to each individual more precisely than conventional treatments," said Joachim Schultze, Director of Systems Medicine at the DZNE and professor at the Life & Medical Sciences Institute (LIMES) at the University of Bonn. "It's critical for science to be able to use such data as comprehensively and from as many sources as possible."


AI with swarm intelligence learns to detect cancer, lung diseases and COVID-19

#artificialintelligence

Following a similar principle--called "swarm learning"--an international research team has trained artificial intelligence algorithms to detect blood cancer, lung diseases and COVID-19 in data stored in a decentralized fashion. This approach has advantage over conventional methods since it inherently provides privacy preservation technologies, which facilitates cross-site analysis of scientific data. Swarm learning could thus significantly promote and accelerate collaboration and information exchange in research, especially in the field of medicine. Experts from the DZNE, the University of Bonn, the information technology company Hewlett Packard Enterprise (HPE) and other research institutions report on this in the scientific journal Nature. Science and medicine are becoming increasingly digital.


Using Supervised Learning to Classify Metadata of Research Data by Discipline of Research

Weber, Tobias, Kranzlmüller, Dieter, Fromm, Michael, de Sousa, Nelson Tavares

arXiv.org Machine Learning

Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records, which is published alongside this paper. These data allow to reproducibly assess classification approaches, such as tree-based models and neural networks. According to our experiments with 20 base classes (multi-label classification), multi-layer perceptron models perform best with a f1-macro score of 0.760 closely followed by Long Short-Term Memory models (f1-macro score of 0.755). A possible application of the trained classification models is the quantitative analysis of trends towards interdisciplinarity of digital scholarly output or the characterization of growth patterns of research data, stratified by discipline of research. Both applications perform at scale with the proposed models which are available for re-use.